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ABSTRACT 


The prevalence of online education systems provides oppor- 
tunities to deliver personalized learning at scale. Educa- 
tional systems need to assess students so that they can pro- 
vide better curricula tailored to each student’s unique needs. 
Since there is a limited amount of time for quizzing a stu- 
dent, we need to test each student using those questions 
that capture the most information about their level of un- 
derstanding of various concepts. In this paper, we formally 
pose the problem and present multiple approaches for learn- 
ing a quizzing policy to determine a personalized sequence of 
questions for each student that best predicts their knowledge 
state. We first introduce simple heuristics including random 
selection and an uncertainty sampling approach inspired by 
an active learning framework. We then develop a reinforce- 
ment learning (RL) approach for designing a quizzing policy. 
Using simulations of students’ knowledge states, we provide 
initial evidence that an RL-based approach can improve over 
simple heuristics. We further demonstrate the effectiveness 
of our approaches using a real-world dataset consisting of 
over 1.5 million examples of students’ answers to mathe- 
matics questions from Eedi, an online educational platform. 
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1. INTRODUCTION 


Online education systems are making high-quality educa- 
tion more accessible for students across the globe. These 
systems provide various educational resources such as in- 
structional videos and exercises. To provide personalized 
curricula for improving the learning outcomes of students, 
an online education system needs to accurately infer each 
student’s knowledge state (i.e., their level of understanding 
of various concepts) by quizzing them. This is a challeng- 
ing task because the quizzing time is limited. To make the 
most efficient use of each student’s time, it is important to 
prioritize those questions that reveal the most information 
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about the student’s knowledge. 


We focus on a specific goal for student assessment: given a 
limit to the number questions we are allowed to ask each stu- 
dent, how can we determine a sequence of questions for each 
student that best predicts their knowledge state? Specifically, 
when an education system needs to assess a student for in- 
ferring their knowledge, the system suggests a personalized 
question to query for the student and gets their response 
to the question. Based on the student’s response history 
(i.e., a sequence of question-response pairs), the system se- 
lects another question to query for the student until it has 
exhausted its query budget (i.e., the maximum number of 
queries allowed). We refer to the function that provides the 
next question to query based on students’ response histories 
as quizzing policy (QP). 


We define the task of learning a QP in the context of the 
NeurIPS 2020 Education Challenge [27] launched by Eedi 
[6], an online educational platform with thousands of ac- 
tive users daily around the globe. We consider a set of 948 
multiple-choice mathematics questions that correspond to 
57 different concepts. Specifically, the task is to obtain a 
limited set of answers from each student for inferring the 
student’s knowledge on the 57 concepts and then predict 
the student’s performance on unseen questions based on the 
inferred knowledge state. 


The key challenge in designing a QP is related to a cru- 
cial task in machine learning: active learning (AL). For 
many learning tasks (e.g., image classification, text classi- 
fication), obtaining sufficient labeled data for training high- 
performance models is costly [16, 18, 32]. AL aims to reduce 
the amount of annotated data needed by having the model 
carefully select which data points should be labeled. 


Existing methods for AL include heuristics such as select- 
ing the data points about which the model is most uncer- 
tain (i.e., uncertainty sampling) [15, 26, 31, 24], picking 
the instances about which a set of possible different mod- 
els disagree the most (i.e., query by committee) [23, 10], or 
choosing the example that can lead to the most immediate 
improvement in model performance (i.e., estimated error re- 
duction) [22, 12]. 


In addition to these heuristics for AL, recent studies [30, 
19, 9] have explored how to use reinforcement learning (RL) 
to learn the AL strategy itself. RL [20, 25] is a powerful 
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framework where an agent learns how to make good deci- 
sions (actions) in different situations (states) through trial 
and error. In the RL terminology, the action space provides 
the set of actions that can be taken by the RL agent at a 
given point in time; the state space defines the “state of the 
world” that is visible to the RL agent; and the reward func- 
tion assigns a value to the outcome of each action taken by 
the RL agent. In this case, the set of possible instances to 
be labeled defines the action space; the state space is a rep- 
resentation of the sequence of instances that have already 
been annotated; and the gain in prediction accuracy as a re- 
sult of an action defines the reward. The RL agent learns to 
improve its decision-making over time based on the reward 
signals it receives. Inspired by these studies, we investigate 
using RL to learn a QP for personalized student assessment. 


1.1 Our approach and contributions 

In this paper, we formalize the problem of learning a QP for 
inferring the student knowledge state and present several 
different approaches including simple heuristics and an RL- 
based approach. Our contributions are: 


e We formulate the problem of learning a QP to infer stu- 
dent knowledge. 


e We propose simple heuristics (i.e., random selection, un- 
certainty sampling) and an RL-based approach for learn- 
ing a QP. 


e We evaluate the performance of different QPs on a syn- 
thetic dataset and a publicly available dataset consisting 
of over 1.5 million examples of students’ answers to math- 
ematics questions from Eedi. 


For the reproducibility of experimental results and facilitat- 
ing research in this area, the code and dataset are publicly 
available.* 


1.2 Related work 


AL is a popular methodology in machine learning that aims 
to reduce the amount of annotated data needed by hav- 
ing the model carefully select which data points should be 
labeled. The task of designing a QP is closely related to 
AL because the goal is to optimally select a set of ques- 
tions to ask students to gain the most information about 
their knowledge states. Uncertainty sampling [15, 26, 31, 
24] is one of the most popular heuristics for AL because it is 
straightforward and computationally efficient. Specifically, 
it suggests labeling instances that are closest to the model’s 
decision boundary (i.e., the most uncertain). Woodward and 
Finn [30] propose the first application of RL to the task of 
AL for image classification. Other studies [19, 9] explore 
how to train an AL policy that can generalize across diverse 
datasets. 


RL has also been applied to various tasks in education such 
as learning an instructional policy [2, 3, 5, 13, 17, 21, 28], 
learning a hint policy for helping students solve multi-step 
problems [7], and generating new educational tasks [1]. We 
introduce a different policy, a quizzing policy for inferring 
the student knowledge state, which has not been designed 
using RL in previous literature. 


‘https: //github.com/joyheyueya/quizzing-policy 
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Figure 1: Graphical representation of knowledge. This is 
an example of an undirected graph where each node (circle) 
represents a concept, and each edge connects a pair of similar 
concepts: c1 is an independent concept, ce and cz are similar, 
and c4, Cs, and cg are similar. 


There is prior work on the efficient assessment of knowledge 
[8]. Our student knowledge model is inspired by the knowl- 
edge components (i.e., concepts / skills) used in Bayesian 
Knowledge Tracing (BKT) [4], which represents the state 
for each knowledge component as a binary variable: 1 if the 
knowledge component is known, 0 otherwise. 


2. PROBLEM FORMULATION 


In this section, we formalize the problem of learning a quizzing 
policy (QP) for inferring the student knowledge state. 


2.1 Student knowledge state 


Our goal is to infer student knowledge on a set of n con- 
cepts C = {c1,...,¢n} associated with a set of m questions 
X = {x1,...,%m}. For simplicity, each question corresponds 
to a single concept, but each concept might be associated 
with more than one question (m >> n). A student’s knowl- 
edge state h is defined as h = [v1,...,Un] where v1, ...,Un are 
binary variables that indicate whether or not the student 
knows each concept in C: v; = 1 if c is known, and v; = 0 
otherwise. Formally, we define a hypothesis space H for all 
possible knowledge states: 7 = {0,1}". We assume h is 
fixed during the assessment. 


2.2 Graphical representation of knowledge 
We consider two assumptions that are useful for inferring 
the student knowledge state: 1) difficult concepts are more 
likely to be unknown, and easy concepts are more likely to 
be known; 2) similar concepts are more likely to have the 
same value (i.e., a student who knows one concept is also 
likely to know the other concepts that are similar to the one 
that is already known). These influences can be represented 
by an undirected graph where each node corresponds to a 
concept, and each edge connects a pair of concepts that are 
similar (see Figure 1). In the Eedi dataset (described in 
Section 4.2.1), we consider every pair of concepts that share 
the same super-concept to be similar (e.g., there is an edge 
between “Rearranging Formula and Equations” and “Substi- 
tution into Formula” because they are both under the same 
super-concept “Formula”). Based on this graphical struc- 
ture, we model a student’s knowledge state using a Markov 
Random Field (MRF). 


An MRF is a probability distribution over a set of vari- 
ables that satisfy certain properties defined by an undirected 
graph. In our case, we define a probability distribution 
p over binary variables vj,...,Un defined by an undirected 
graph G = (V U F,£) where V is the set of nodes (con- 
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cepts), F' is the set of factors that define a set of functions 
over the variables that they are connected with, and F is 
the set of edges (see Figure 2). 


An MRF allows us to calculate the probability of each way of 
assigning values to binary variables v1, ...un, which represent 
the knowledge state of the corresponding concepts C1, ..., Cn- 
The probability p has the form: 


POs 4%) =F [I va(va) (1) 


back 


where a represents a subgraph of G, and Wa denotes a factor 
that defines a non-negative function over the set of variables 
Va ina. Z is a normalizing constant that ensures the distri- 
bution sums to one: 


Z= Y> J] valva) (2) 


V1y-Un WabF 


We specify factors for an MRF based on two assumptions 
about variables v1,...,Un. For our first assumption that dif- 
ficult concepts have a higher probability of being unknown, 
we define unary factors: 


1— difficulty,. ifvi=1 
wi(vi) = : : : 
difficulty,, otherwise 


where difficulty,, is a real number that represents the dif- 
ficulty of the concept c; that v; corresponds to, and 0 < 
difficulty,, <1. A higher difficulty,, value means c; is more 
difficult. 


For our second assumption that variables corresponding to 
similar concepts are more likely to have the same values, 
we define binary factors between every pair of nodes (v4, v;) 
that are connected by an edge in graph G: 


influence if uj = v; 


Pig} Ut V9) = {i — influence otherwise 

where influence represents a constant that satisfies 0.5 < 
influence < 1. A greater influence value means we want to 
assign a higher probability to an assignment that gives the 
same values to variables corresponding to similar concepts. 
In our work, we fix influence to be 0.7. We also tried similar 
values, and they lead to similar results. 


2.3 Quizzing policy for knowledge inference 
Since there is a cost associated with each question we query 
students (e.g., time, student’s energy), we need to select a 
limited number of questions that reveal the most about their 
knowledge state. Thus, student knowledge prediction can 
be framed as a pool-based active learning (AL) task with a 
given query budget 7’. For simplicity, we assume querying 
each exercise leads to the same cost and define T to be the 
total number of queries we are allowed. 


We describe the AL framework in detail, see Algorithm 1. 
At a given time step t, we have a labelled set L that con- 
sists of all the questions we have asked the student and their 
responses. Formally, L = {(x',y*)}{_, where x’ € X, and 
y’ € {0,1} is the student’s response to 2’ (y’ = 0 if the 
response is incorrect, y’ = 1 if the response is correct). We 


[vee 
rea 


Figure 2: Modeling graphical student knowledge using 
MRF. This models the knowledge representation in Figure 
1 as a factor graph. Each node v; is a binary variable that 
represents the knowledge state of the corresponding concept 
c;. Factors are represented by rectangles. There is a unary 
factor for every node and a binary factor between every pair 
of nodes connected by an edge to model the dependency 
between variables. 


also have an unlabelled set U consisting of all the questions 
that we have not asked). Based on L, we have a belief Br 
about the student’s knowledge h. Formally, B;, = [b1, ..., On] 
where 6; is the probability of knowing the concept c; (i.e., 
v; = 1 with a probability of b;). We define Binary(Bn) 
as a function that converts probabilities into binary values 
using a threshold of 0.5 (1 if b; > 0.5 and 0 otherwise). 
Binary(B),) gives the inferred binary knowledge state. We 
update B;, based on L by running the Loopy Belief Propaga- 
tion algorithm (LBP) [11] on our graph defined in Section 
2.2. LBP takes L as input and outputs the probabilities 
bi, ...,6n (0 < Bb < 1). Additionally, we have a QP that 
takes By, as input and outputs the next question to ask the 
student. Specifically, a policy 7(-|Br) provides a probability 
distribution with support over all questions in U given Bp. 
We can then sample a question from z(-|Bp). 


Algorithm 1: Active learning for inferring knowledge 
Input: budget T, quizzing policy 
Output: h 
Initialize Lo + 0, Uo + {ai}, 
for t = 1,2,3,...,7' do 
By, = LBP(L1-1) 
xt ~ (Bra) 
It << Lt-1 U (x*, y*) 
Uz — Ui-1\2" 
end 


h < Binary(Bn,) 


Algorithm 1 runs as follows: at each time step t, we first 
get our current belief Bz, based on the previously labelled 
set Li—1 (i.e., the set of all the questions we have asked the 
student before time step t and their responses). We then 
select a question a’ from the previously unlabelled set Ut_1 
to ask the student by sampling from 7(-|Bn,), which defines 
a probability distribution with support over all questions in 
Uz-1 given Bn,. Then, we update Uz_1 to Ut by removing at 
from U;-1 and update L4~1 to Li by adding a’ and its label 
y’ to Ly_1. The quizzing process terminates when the query 
budget is exhausted. In this work, we fix T’ = 10 as required 
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by the NeurIPS 2020 Education Challenge [27]. The final 
output of the algorithm h is the student’s knowledge state 
at time step T, which is inferred based on By. 


2.4 Evaluation 

We evaluate our QPs using two methods. First, we create a 
synthetic dataset consisting of simulated students (see Sec- 
tion 4.1). We predict each student’s knowledge state using 
Algorithm 1. Given a prediction result h, we calculate the 
prediction accuracy using the following equation: 


7 1 - ae KT: 
Ace(h) = = S> 1 (ffi) = h* Ud) (3) 
i=l 
where h* is the actual knowledge state. 


Second, we apply our QPs to the NeurIPS 2020 Education 
Challenge (see Section 4.2). The challenge is to obtain a 
limited set of answers from each student for predicting the 
correctness of their answers to the remaining questions. Our 
approach to this challenge is to first infer a student’s knowl- 
edge state using Algorithm 1 and then predict the student’s 
responses to the remaining questions based on the inferred 
knowledge state. Specifically, we design an additional model 
(see Section 4.2.2) that takes in our belief about the stu- 
dent’s knowledge state at the final time step By, and out- 
puts the student’s response to each of the m questions. For- 
mally, the vector y € R™ denotes the output of the model. 
We calculate the prediction accuracy as: 


Aeo(D) = S> 19) = 9*fa) (4) 


x2,E€Up 


where Ur is the set of unlabelled questions at the final time 
step (unseen by the model), Y*[i] is the student’s actual 


response to 7;, and Vii] is the predicted response. 


3. DESIGNING QUIZZING POLICIES 


In this section, we present heuristics-based approaches and 
a reinforcement learning (RL)-based approach to designing 
a quizzing policy (QP) that takes in a belief B;, about a 
student’s knowledge state and outputs the next question to 
ask the student. 


3.1 Heuristic approaches 
We present two simple heuristics for designing a QP: random 
selection (QP-RANDOM) and uncertainty sampling (QP- 
UNCERTAIN). QP-RANDOM is straightforward: we always 
randomly select a question from the unlabelled set U (i.e., 
t(a|Br) = al for each a € U). QP-UNCERTAIN suggests 
picking a question corresponding to a concept that our cur- 
rent model is most uncertain about (i.e., the concept with a 
probability of being known that is closest to 0.5). Formally, 
we define: 

b* = arg min |b; — 0.5] 

bj €Bp, 

We first pick a concept c* with a probability of being known 
that is equal to b*. We break ties randomly. We define U<« 
as the set of questions that have not been asked and are 
associated with c*. We then define the policy: 


m(a|Bn) = tial 


if a € Ucs 


0 otherwise 


LBP —>B,, —> 6 — 7,(als;) 


I 


ar = xt 
rr 
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Figure 3: QP-RL approach. 


3.2 RL-based approach 


We now propose an RL-based approach (QP-RL) for learn- 
ing a QP. An RL agent learns how to make good decisions 
over time by interacting with an environment that is typi- 
cally modeled as a Markov Decision Process (MDP). In our 
problem setting, we define the MDP M = (5S, A, P, R, so) as 
follows: 


e The state space S is the set of beliefs By, about student 
knowledge (i.e., S = {[b1, ..., bn]|O < bi < 1}); 


e The action space A is the set of questions that have not 
been asked; 


e The transition dynamics P : S x A x S — R define the 
probability of transitioning from one state to another by 
taking a particular action. In our case, we transition to 
state sz41 from s; based on the student’s response y’. 


e The reward function R: S x Ax S —> R is defined as the 
difference in prediction accuracy between the current time 
step and previous step: for predicting student knowledge, 
given the inferred knowledge state hi+1 after taking action 


at, we calculate the reward for time step t as Acc(hi+1) — 
Acc(h+t); 


e The initial state so corresponds to the initial belief about 
student knowledge: each concept has a 0.5 probability of 
being known. 


Figure 3 shows an overview of the QP-RL approach. For 
training the RL agent, we consider an episodic, finite-horizon 
setting. During each episode, we train on one student’s data, 
and the length of the episode is the query budget T. At each 
time step t, we run the LBP algorithm that takes in the 
student’s response history Le-1 = {(x*,y')}‘=] to update 
our belief about the student’s knowledge state B,,. Then, 
the RL model, which is a neural network with parameters 
6, takes Bp, as input (ie., s: = Bn,) and outputs a vector 
Pc € R” which represents the probability of selecting a ques- 
tion corresponding to each of the n concepts. We first select 
a concept c; by sampling based on pe. and then randomly 
select one question from U., (a set of questions that have not 
been asked and are associated with c;). We then define the 


final policy parametrized by 0: 76(a|Bn,) = a for c, CE C 
and a € A. Our policy 79(a|Bn,) allows us to select the 
next question to query and add the next question-response 
pair (z',y) to the response history. We then update Bn, 
based on the updated response history using the LBP al- 
gorithm. We calculate the reward for the current time step 
r, = Acc(Binary(Bn,,,)) — Ace(Binary(Bn,)). 
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We use REINFORCE policy gradient method [25, 29] to 
learn our policy 79 parametrized by @. In each episode cor- 
responding to a single student, the RL agent performs an 
update as follows. First, an initial state so (the initial belief 
that each concept has a 0.5 probability of being known by 
the student) is generated. Then, the policy 7 is executed 
until the episode ends, generating a sequence of experience 
given by (s¢, at, Tt)t=1,2,...,7- Then, in this episode, for each 
t € {1,2,..., 7}, we use the following gradient update with 
7 as learning rate: 


qe 


6+ O0+4n- (Dr) : (vo log (0 (at | si))) (5) 


T=t 


“ —————————— 


gradient at time step t in an episode 


In experiments, we use the architecture used in [7]. Specifi- 
cally, the policy network is a 3-layer fully connected neural 
network with the following architecture: the input layer has 
n = 57 units for B,; the first and second hidden layers 
have 128 hidden units; and the output is a vector pe € R” 
where n = 57 to produce a probability of selecting each of 
the 57 concepts. The first two hidden layers use ReLU ac- 
tivations, and the final layer uses the softmax function to 
ensure probabilities sum to 1. We use ADAM [14] optimizer 
for training. 


4. EXPERIMENTAL EVALUATION 


We first evaluate and compare our quizzing policies (QPs) 
using a synthetic dataset. We then apply our QPs to the 
Eedi dataset from the NeurIPS 2020 Education Challenge. 


4.1 Simulations 

We simulate virtual students taking the assessment quiz and 
test how well we can predict students’ knowledge states in 
a controlled setting using different QPs. 


4.1.1 The synthetic dataset 

We generate a dataset consisting of 24,000 simulated stu- 
dent knowledge states. To do so, we first construct a graph 
for representing the student knowledge state that we aims 
to infer (see Section 2.2) and then get a probability distri- 
bution over the binary variables in the knowledge state that 
satisfies a set of assumptions about the student’s knowledge. 
We then sample ground-truth student knowledge state val- 
ues from the probability distribution. In this simulation, we 
use the same 57 concepts in the Eedi dataset (described in 
Section 4.2.1) for constructing the graph. We assume some 
of these concepts have different levels of difficulty, and simi- 
lar concepts are more likely to be assigned the same knowl- 
edge state values.” Based on these assumptions, we assign 
a value of difficulty to each of the 57 concepts. We define 
difficulty, = 1— the average correctness of the concept c 


? Although our assumptions might not hold in a real-world 
setting, the goal of this experiment is to compare differ- 
ent QPs and investigate the potential of QP-RL for learn- 
ing a strategy tailered to a pre-defined knowledge struc- 
ture. For instance, compared to the heuristic approach QP- 
UNCERTAIN, QP-RL should learn to select the questions 
that are not only uncertain but can also give more infor- 
mation about other questions that are not selected (e.g., 
selecting questions corresponding to concepts that are con- 
nected with a lot of the other concepts). 


Table 1: Test performance of different QPs on the syn- 
thetic dataset. QP-UNCERTAIN achieves a better perfor- 
mance than QP-RANDoM, and QP-RL improves over QP- 
UNCERTAIN significantly. 


QP Accuracy 
QP-RL 0.721 + 0.004 
QP-UNCERTAIN | 0.700 + 0.002 
QP-RANDOM 0.675 + 0.003 


° 
N 
N 


0.70 4 


Cumulative average accuracy 


0.68 4 
0.66 QP-RL 
—— QP-UNCERTAIN 
— QP-RANDOM 
0.64 4_ ; 
0 5000 10000 15000 20000 


Number of episodes 


Figure 4: Training performance of QP-RL on the synthetic 
dataset compared to heuristics. QP-RL improves over QP- 
RANDOM and QP-UNCERTAIN after about 6,000 episodes of 
training. The cumulative average accuracy at each episode 
is calculated as the average accuracy across all previous 
episodes. It is important to note that QP-RANDOM and 
QP-UNCERTAIN are fixed policies that are not being trained. 
The cumulative average accuracy for the first few episodes 
might seem noisy due to small sample size. 


across all students’ answers in the Eedi dataset. We run 
the LBP algorithm on the constructed graph to get a proba- 
bility distribution from which we sample student knowledge 
states. Specifically, the output of the LBP algorithm gives 
the probability of knowing each concept, and we sample val- 
ues of 0 or 1 for each concept to generate the ground-truth 
student knowledge states in our synthetic dataset. 


4.1.2 Results 

We split the dataset into 23,000 students as the training 
set and 1,000 students as the test set. We train QP-RL 
until the cumulative average accuracy converges. Figure 
4 shows the training performance of QP-RL compared to 
fixed heuristics. After training, we run each QP 10 times 
on the test set to calculate the average accuracy and stan- 
dard deviation across these 10 trials, see Table 1. Although 
QP-RL leads to a 2% gain in accuracy compared to QP- 
UNCERTAIN, it requires a moderate amount of training data 
(> 6,000 students in this case). QP-UNCERTAIN is a less 
optimal strategy but can achieve a reasonably good per- 
formance without any training data. These results provide 
initial evidence that QP-RL can learn an effective QP, and 
the performance can be improved further with more data. 


3For simulations, one could also try other difficulty values, 
but it does not matter which specific difficulty value we as- 
sign to each concept because the goal is to model a setting 
where we have concepts of varying levels of difficulty. 
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What is the size of the 
obtuse angle AOC? 


Figure 5: An example of a question in the Eedi dataset 
[27]. For each multiple-choice question, exactly one choice 
is correct. 


4.2 NeurIPS 2020 Education Challenge 

We then apply our QPs to one of the tasks in the NeurIPS 
2020 Education Challenge (see Section 4.2), which is to ob- 
tain a limited set of answers from each student for predicting 
the correctness of their answers to the remaining questions. 
Our approach to this challenge is to first infer a student’s 
knowledge state using Algorithm 1 and then predict the stu- 
dent’s responses to the remaining questions based on the 
inferred knowledge state. 


4.2.1 The Eedi dataset 


The Eedi dataset contains student responses to multiple- 
choice questions (see Figure 5) on various math topics, which 
was collected between September 2018 and May 2020. It 
contains 948 questions and a total number of 1, 508, 917 re- 
sponses to these questions from 6, 148 students. The dataset 
is split into the training set (4918 students), the validation 
set (615 students), and the test set (615 students). 


Each question in the dataset is associated with a list of sub- 
jects. Each subject covers an area of mathematics. These 
subjects are arranged in a tree structure by experts based 
on the generality of the subjects. For instance, “Fractions” 
is the parent subject of “Multiplying Fractions” and “Simpli- 
fying Fractions”. For simplicity, we only consider the most 
granular subject (i.e., the leaves in the tree) as the concept 
that each question corresponds to. The 948 questions cor- 
respond to 57 unique concepts. We consider concepts that 
share the same super-concept (i.e., parent) to be similar (see 
Figure 1). 


4.2.2 Student performance prediction 

To predict a student’s responses to unseen questions based 
on the inferred knowledge state, we propose a neural network- 
based model that takes in the belief about the student’s 
knowledge Bz, at time T = 10 (our belief about their 
knowledge after we have asked 10 questions) and outputs 
the probability of answering each of the 948 questions in 
the dataset correctly. The student performance prediction 
model is a 3-layer fully connected neural network with the 


Table 2: Test performance different QPs on the Eedi dataset. 
QP-RL improves slightly over QP-UNCERTAIN. 


QP Accuracy 
QP-RL 0.690 = 0.005 
QP-UNCERTAIN | 0.680 + 0.003 
QP-RANDOM 0.684 + 0.003 


following architecture: the input layer has n = 57 units for 
Bn; the first hidden layer has 256 hidden units; the sec- 
ond hidden layer has 512 units; and the output is a vector 
y € R™ where m = 948 to represent the probability of cor- 
rectness for each of the 948 questions. The first two hidden 
layers use ReLU activations, and the final layer uses the sig- 
moid function to ensure the output values are between 0 and 
1. We use ADAM [14] optimizer for training. We convert 
the output probabilities into binary values of 0 or 1 (0 if the 
probability is less than 0.5, 1 otherwise) and calculate the 
prediction accuracy using Equation 4. We train the model 
using randomly selected queries until the validation accu- 
racy converges. The model parameters are updated based 
on binary cross-entropy loss. 


4.2.3 Results 

Given a trained performance prediction model from Section 
4.2.2, we then train QP-RL using the difference in final 
prediction accuracy between time steps as reward signals: 
r, = Acc(Binary(Y;,)) — Acc(Binary(Y,_1)). After train- 
ing, we run each QP 10 times on the test set to calculate the 
average accuracy and standard deviation across these 10 tri- 
als. Table 2 shows that QP-RL improves slightly over QP- 
UNCERTAIN, but the difference between QP-RL and QP- 
RANDOM is not significant. Results in Section 4.1.2 show 
that in a more controlled setting, QP-RL already requires a 
moderate amount of training data (> 6,000 students) to im- 
prove over heuristics. However, we only have training data 
from about 5,000 students in this experiment. Learning a 
QP from real students’ data that are noisy is more challeng- 
ing, and it may be the case that improving QP-RL further 
would require a much larger dataset. Even though QP-RL 
seems to require a substantial amount of training data, this 
is a one-time training, and the learned policy can be applied 
to future students. 


5. CONCLUSION 


Student assessment is a crucial component of many online 
education systems for improving student learning outcomes. 
Inferring student knowledge state by quizzing poses a tech- 
nical challenge: maximizing accuracy while minimizing the 
quizzing cost. In this paper, we show initial evidence that 
reinforcement learning (RL) provides a potential solution, 
improving over heuristics given sufficient training data. 


There are several research directions for future work. Fur- 
ther gains in accuracy could be achieved by exploring more 
powerful RL techniques and more complex student knowl- 
edge modeling techniques. In this work, we model all con- 
cepts that share the same super-concept as having the same 
relationship; however, there could be prerequisites as well 
as weaker and stronger relationships in reality. It would be 
important to study whether varying the influence values be- 
tween concepts would lead to gains in model performance. 
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